Linear Algebra for Data: Vectors, Matrices, Eigenvalues, SVD, and Distance Measures

1) Vectors: The Fundamental Data Object

1.1 What is a vector?

A vector is a 1D array of numbers. You can think of it as:

A list of features for one data point (e.g., height, weight, age).
A point in space (2D, 3D, or higher dimensions).
An arrow with direction and length (geometric view).

Notation: a vector is often written as $x$ . The i-th entry is $x_i$ .

1.2 Geometric picture (vector as an arrow)

In 2D, a vector $x = (x_1, x_2)$ is an arrow from the origin to the point $(x_1, x_2)$ .

Diagram: A vector in 2D (arrow from origin)

x (0,0)

1.3 Norm (length) of a vector

The norm measures vector length. The most common is the Euclidean (L2) norm:
$\|x\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2}$ .

1.4 Unit vectors

A unit vector has length 1: $\|u\|_2 = 1$ .
Any non-zero vector $x$ can be turned into a unit vector by:
$u = \frac{x}{\|x\|_2}$ .

In 2D, unit vectors can be parameterized by an angle $\theta$ :
$u = (\cos\theta, \sin\theta)$ .

1.5 Scalar multiplication and vector addition

Scalar multiplication: $a x$ stretches (or shrinks) the vector. If $a<0$ , it also flips direction.
Vector addition: $x + y$ adds component-wise and corresponds to “tip-to-tail” addition geometrically.

1.6 Inner product (dot product)

The inner product between vectors $x$ and $y$ is:
$x^\top y = \sum_{i=1}^{n} x_i y_i$ .

Geometrically, it connects to the angle $\theta$ between them:
$x^\top y = \|x\|_2 \|y\|_2 \cos\theta$ .

1.7 Projection and orthogonality

If $u$ is a unit vector, the scalar projection length of $x$ onto $u$ is:
$\text{proj length} = x^\top u$ .
The projected vector is:
$\text{proj}_u(x) = (x^\top u)\,u$ .

Two vectors are orthogonal (perpendicular) if their inner product is zero:
$x^\top y = 0$ .

2) Vector Spaces, Span, Linear Independence, and Bases

2.1 Vector space (big idea)

A vector space is a set of vectors where you can add vectors and scale them, and you stay inside the set.
For data work, $\mathbb{R}^n$ (all real vectors with $n$ components) is the most common.

2.2 Span

A set of vectors $\{v_1, v_2, \dots, v_k\}$ spans a space if any vector in that space can be written as a linear combination:
$x = a_1 v_1 + a_2 v_2 + \cdots + a_k v_k$ .

2.3 Linear independence

Vectors are linearly independent if the only way to make:
$a_1 v_1 + \cdots + a_k v_k = 0$
is by using $a_1=\cdots=a_k=0$ .
Intuition: none of the vectors is “redundant” (no vector can be built from the others).

2.4 Basis and standard basis

A basis is a set of vectors that:

Spans the space
Is linearly independent

The standard basis for $\mathbb{R}^n$ is the set of unit vectors along each axis:
$e_1=(1,0,\dots,0),\ e_2=(0,1,0,\dots,0),\ \dots,\ e_n=(0,\dots,0,1)$ .

3) Matrices: Linear Transformations and Data Tables

3.1 What is a matrix?

A matrix is a 2D array of values. We write $A$ for a matrix, and $A_{ij}$ is the entry in row $i$ , column $j$ .
Higher-dimensional arrays are often called tensors.

3.2 Basic operations

Addition/subtraction: elementwise (only if same shape).
Transpose: $A^\top$ swaps rows and columns.
Multiplication: defined when the number of columns of $A$ equals the number of rows of $B$ .

If $A$ is $m\times n$ and $B$ is $n\times p$ , then $C=AB$ is $m\times p$ with:
$C_{ij}=\sum_{k=1}^{n} A_{ik}B_{kj}$ .

Diagram: Shapes in matrix multiplication

A (m×n)   ·   B (n×p)   =   C (m×p)
[ rows=m ]     [ rows=n ]      [ rows=m ]

[ cols=n ]     [ cols=p ]      [ cols=p ]
Requirement: inner dimensions match (n)
      

3.3 Key properties

Associative: $(AB)C = A(BC)$
Distributive: $A(B+C)=AB+AC$
Not commutative (usually): $AB \neq BA$

3.4 Special matrices

Square: $n\times n$
Diagonal: non-zeros only on diagonal
Identity: $I$ , diagonal of 1s
Inverse: $A^{-1}$ such that $AA^{-1} = A^{-1}A = I$

A matrix is invertible iff its determinant is non-zero:
$\det(A)\neq 0$ .

3.5 Orthogonal matrices

An orthogonal matrix is a square matrix whose columns are unit vectors and mutually orthogonal.
It satisfies:
$A^\top A = I$
and therefore:
$A^{-1} = A^\top$ .

Intuition: orthogonal matrices represent rotations (and reflections). They preserve lengths and angles.

3.6 Matrices as linear transformations

A matrix $A$ transforms a vector $x$ into $Ax$ . In 2D and 3D, you can visualize this as “bending/morphing” space:

Diagonal matrices: scaling/stretching along axes
Orthogonal matrices: rotation (or reflection)
General matrices: combinations of rotation + scaling (and possibly shear)

4) Eigenvalues and Eigenvectors: Directions That Don’t Change (Much)

4.1 The core idea

When a matrix transforms space, most vectors change direction.
But some special vectors keep their direction (they may flip sign) and only get scaled.
Those are eigenvectors.

Definition: $v$ is an eigenvector of $A$ with eigenvalue $\lambda$ if:
$Av = \lambda v$ .

Diagram: Eigenvector direction preserved under transformation

origin

Av = λv

Same direction, different length

4.2 How to compute eigenvalues

Start with:
$Av=\lambda v$
Rearrange:
$(A - \lambda I)v = 0$

For a non-zero $v$ to exist, $A - \lambda I$ must be singular:
$\det(A - \lambda I)=0$ .
Solving this equation gives eigenvalues $\lambda$ .

4.3 Real vs complex eigenvalues (and “no real eigenvectors”)

Some transformations (like a pure rotation in 2D by a non-180° angle) do not have real eigenvectors.
In that case, eigenvalues/eigenvectors exist in the complex numbers (involving $i$ , where $i^2=-1$ ).
For many data applications, we often work with matrices (like symmetric covariance matrices) that guarantee real eigenvalues.

5) Eigendecomposition and Diagonalization

5.1 Eigendecomposition

If a matrix has enough eigenvectors to form a basis, it can be decomposed as:
$A = Q \Lambda Q^{-1}$
where:

$Q$ is a matrix whose columns are eigenvectors
$\Lambda$ is a diagonal matrix of eigenvalues

5.2 Symmetric matrices (very important in data)

If $A$ is symmetric ( $A=A^\top$ ), then it has a particularly nice form:
eigenvectors can be chosen orthonormal, so $Q^{-1}=Q^\top$ , and:
$A = Q \Lambda Q^\top$ .

This shows a clean geometric story: rotate into the eigenvector coordinate system, scale by eigenvalues, rotate back.

6) Singular Value Decomposition (SVD): The Workhorse of Data Science

6.1 The decomposition

Any matrix $A$ (even non-square) can be decomposed as:
$A = U S V^\top$
where:

$U$ : left singular vectors (orthonormal)
$V$ : right singular vectors (orthonormal)
$S$ : diagonal matrix of singular values (nonnegative)

6.2 Connection to eigenvectors

Singular vectors relate to eigenvectors of:
$A^\top A$ and $A A^\top$ .
The singular values are square roots of eigenvalues of these matrices:
$\sigma_i = \sqrt{\lambda_i}$ .

6.3 Geometric meaning: rotate → scale → rotate

Diagram: SVD as rotate → scale → rotate

Vᵀ
Rotate input space

→

S
Scale along axes

→

U
Rotate output space

Why SVD matters: it powers dimensionality reduction (PCA), noise filtering, compression, recommendations, and more.

7) Data & Measurement: How Linear Algebra Represents the Real World

7.1 Entities and attributes

In data:

Entity = an object/record/instance (e.g., a student, a document, a transaction)
Attribute = a property/feature/variable (e.g., GPA, word count, price)

7.2 Discrete vs continuous attributes

Discrete: finite/countable (zip code, words, categories)
Continuous: real-valued measurements (temperature, height, time)

7.3 Tabular data as a matrix

A dataset is often a matrix $X$ :

Rows = entities
Columns = attributes

Example: if you have 10,000 students and 20 features each, then $X$ is a $10000\times 20$ matrix.

7.4 Document data (term vectors)

A document can be converted into a term vector (bag-of-words): each dimension counts how many times a word appears.
Then documents become vectors, and a collection of documents becomes a matrix.

7.5 Transaction data

Each transaction is a set of items (e.g., products bought together). You can convert this into a binary matrix:

Row = transaction
Column = product
Entry = 1 if product is present, else 0

8) Distance Measures: Quantifying “Closeness” Between Data Points

8.1 Points as vectors in Euclidean space

Once you represent entities as vectors, you can measure how similar/different they are by using a distance function.
In geometry terms, each data point is a point in $\mathbb{R}^n$ .

8.2 What makes a distance a “metric”?

A true metric distance $d(x,y)$ satisfies:

Positivity: $d(x,y)\ge 0$ , and $d(x,y)=0$ iff $x=y$
Symmetry: $d(x,y)=d(y,x)$
Triangle inequality: $d(x,z)\le d(x,y)+d(y,z)$

8.3 Manhattan (L1) distance

$d_1(x,y) = \sum_{i=1}^n |x_i - y_i|$

Intuition: like navigating city blocks (e.g., NYC streets), moving along axes only.

8.4 Euclidean (L2) distance

$d_2(x,y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$

Intuition: straight-line distance “as the crow flies.”

Diagram: Manhattan vs Euclidean distance (grid intuition)

A B

Euclidean (straight line)

Manhattan (axis steps)

8.5 Weighted Euclidean distance

If some features matter more than others, use weights:
$d_w(x,y)=\sqrt{\sum_{i=1}^n w_i (x_i-y_i)^2}$ .
Larger $w_i$ makes differences in feature $i$ count more.

9) Standardization: Fixing the “Units Problem”

Distances can be misleading when features have different scales (e.g., “income” in dollars vs “age” in years).
A common fix is standardization (z-scoring):

For feature $x$ , convert each value to:
$z = \frac{x - \mu}{\sigma}$
where $\mu$ is the mean and $\sigma$ is the standard deviation.

After standardizing, each feature has mean 0 and standard deviation 1, so no single feature dominates distances just because it has big numeric units.

10) Correlation Between Variables and Mahalanobis Distance

10.1 Why Euclidean distance can fail with correlated features

If two attributes are strongly correlated (e.g., “height in inches” and “height in centimeters” or related biology measurements),
Euclidean distance effectively counts the same underlying variation multiple times.
Also, data may be stretched more in one direction than another.

10.2 Mahalanobis distance

Mahalanobis distance accounts for feature scaling and correlation using the covariance matrix $\Sigma$ :
$d_M(x,y)=\sqrt{(x-y)^\top \Sigma^{-1} (x-y)}$ .

Intuition: it measures distance after “whitening” the data—transforming it so correlated directions are handled properly and the cloud becomes more spherical.
In many statistical and ML settings (anomaly detection, multivariate outlier scoring), Mahalanobis distance is a better fit than plain Euclidean distance.

11) Distance for Binary Data: Matching, Hamming, and Jaccard

For binary vectors (0/1), it helps to count how two vectors match:

$n_{11}$ : count of positions where both are 1
$n_{10}$ : count where first is 1 and second is 0
$n_{01}$ : count where first is 0 and second is 1
$n_{00}$ : count where both are 0

11.1 Matching distance

A basic idea is to measure (dis)agreement over all positions. Variants exist depending on whether you treat 0-0 matches as meaningful.

11.2 Hamming distance

Hamming distance counts how many positions differ:
$d_H(x,y)=n_{10}+n_{01}$ .
This is often the simplest and most widely used for fixed-length binary strings.

11.3 Jaccard distance

For sparse binary data (like “did the user click this item?”), shared zeros are usually not informative.
Jaccard similarity focuses on “1” matches:
$J(x,y)=\frac{n_{11}}{n_{11}+n_{10}+n_{01}}$ .
Then Jaccard distance is:
$d_J(x,y)=1-J(x,y)$ .

12) Summary: How the Pieces Fit Together

Vectors represent individual data points (features for an entity).
Matrices represent datasets and linear transformations.
Eigenvectors/eigenvalues identify special directions that scale under transformation.
SVD generalizes this idea for any matrix and is foundational for dimensionality reduction and structure discovery.
Distances quantify similarity; choosing the right one depends on data type and geometry.
Standardization and Mahalanobis handle scale and correlation, which matter in real datasets.